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ABSTRACT 

Regular path queries (RPQs) select nodes connected by some 
path in a graph. The edge labels of such a path have to form 
a word that matches a given regular expression. We inves- 
tigate the evaluation of RPQs with an additional constraint 
that prevents multiple traversals of the same nodes. Those 
regular simple path queries (RSPQs) find several applica- 
tions in practice, yet they quickly become intractable, even 
for basic languages such as (aa)* or a*ba*. 

In this paper, we establish a comprehensive classification 
of regular languages with respect to the complexity of the 
corresponding regular simple path query problem. More pre- 
cisely, we identify the fragment that is maximal in the fol- 
lowing sense: regular simple path queries can be evaluated in 
polynomial time for every regular language L that belongs to 
this fragment and evaluation is NP-complete for languages 
outside this fragment. We thus fully characterize the frontier 
between tractability and intractability for RSPQs, and we 
refine our results to show the following trichotomy: Evalua- 
tions of RSPQs is either AC , NL-complete or NP-complete 
in data complexity, depending on the regular language L. 
The fragment identified also admits a simple characteriza- 
tion in terms of regular expressions. 

Finally, we also discuss the complexity of the following 
decision problem: decide, given a language L, whether find- 
ing a regular simple path for L is tractable. We consider sev- 
eral alternative representations of L: DFAs, NFAs or regu- 
lar expressions, and prove that this problem is NL-complete 
for the first representation and PSPACE-complete for the 
other two. As a conclusion we extend our results from edge- 
labeled graphs to vertex-labeled graphs and vertex-edge la- 
beled graphs. 

1. INTRODUCTION 

The reachability problem for graphs (finding a path 
between two nodes) represents one of the oldest prob- 
lems in computer science for which very efficient algo- 
rithms have been conceived. However, for many real- 
world problems, constraints on the path need to be con- 
sidered and, as a consequence, the reachability problem 
can become computationally hard. Problems on regular 
paths are among the most studied class of constrained 



path problems. In these problems the edge labels along 
the path must form a word belonging to a given regu- 
lar language. For graph databases, such problems have 
been considered in the context of regular path queries 
(RPQs). Given a language L and two vertices in a 
database graph, a regular path query selects pairs of 
nodes connected by a path whose edge labels form a 
word in L. Graph databases and RPQs have been inves- 
tigated starting from the late 80s [TJ [5} [8} [9| [lOj [13} [14} 
[18} [27} [29] , and are now again in vogue due to their wide 
application scenarios, e.g. in social networks [35}, bio- 
logical and scientific databases 26 32 , and the Seman- 
tic Web [17] . Regular path queries allow to traverse the 
same nodes multiple times, whereas regular simple path 
queries (RSPQs) permit to traverse each vertex only 
once. From a theoretical viewpoint, the former notion 
has overridden the latter, mainly for complexity rea- 
sons. Indeed, RPQs are computable in time polynomial 
in both query and data complexity (combined complex- 
ity), while the evaluation of RSPQs is NP-complete even 
for fixed basic languages such as (aa) 
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or a*ba* 

RSPQs, however, are desired in many application sce- 
narios 26 [32} [6j [24} 22 39 , such as transportation 
problems, VLSI design, metabolic networks, DNA match- 
ing and routing in wireless networks. As a further exam- 
ple, the problem of finding subgraphs matching a graph 
pattern can be generalized to use regular expressions 
on pattern edges 14 . Such queries may also enforce 
the condition that their matched vertices are distinct. 
Additionally, regular simple paths have been recently 
considered in SPARQL 1.1 queries exhibiting property 
paths. In particular, recent studies on the complexity of 



property paths in SPARQL [3 28 have highlighted the 
hardness of the semantics proposed by W3C to evaluate 
such paths in RDF graphs. Roughly speaking, accord- 
ing to the semantics considered in [28| , the evaluation of 
expressions under Kleene-star closure imposes that the 
involved path is simple, whereas the evaluation of the 
remaining expressions allows to traverse the same node 
multiple times. As such, the semantics studied in 28 
is an hybrid between regular paths and regular simple 
paths semantics. 
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Contributions. In this paper, we address the long stand- 
ing open question [29[ [6] of characterizing exactly the 
maximal class of regular languages for which RSPQs are 
tractable. By "tractable" we mean computable in time 
polynomial in the size of the database. Precisely, we es- 
tablish a comprehensive classification of the complexity 
of RSPQs for a fixed regular language L: given a edge- 
labeled graph G and two vertices x and y, is there a 
simple path from x to y whose edge labels form a word 
of LI A first step towards this important issue has been 
made in 
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They exhibit a tractable fragment: the 
class of languages closed by subword. However, their 
fragment is not maximal. 

Our contributions can be detailed as follows. We 
introduce a class of languages, named trC, for which 
RSPQs are computable in polynomial time, and even 
in NL. We then show that this fragment is maximal 
as the RSPQ problem is NP-completc for every regular 
language that does not belong to trC. Consequently, 
we characterize, under the hypothesis NL =^ NP that 
is actually weaker than Ptime ^ NP, the frontier be- 
tween tractability and intractability for this problem. 
Additionally, trC also represents the maximal class for 
which finding a shortest path that satisfies a RSPQ is 
tractable. We note that we focus on data complexity as 
we assume that the language L is fixed. At this point, 
the chart of the classification of the languages is not 
yet complete. Therefore, we refine our results to show 
the following trichotomy: the RSPQ problem is either 
AC , NL-complete or NP-complete. 

We discuss the complexity to decide, given a lan- 
guage L, whether the RSPQ problem for L is tractable. 
We consider several alternative representations of L: 
DFAs, NFAs or regular expressions. We prove that this 
problem is NL-complete for the first representation and 
PSPACE-complete for the two others. 

Next, we give a characterization of the tractable frag- 
ment trC for edge-labeled graphs in term of regular 
expressions. Moreover, trC is closed by union and in- 
tersection and languages in trC are aperiodic i.e. can 
be expressed by first-order formulas |37| . 

The above results hold for the common definition of 
database graphs, i.e. edge-labeled graphs. However, it 
seems natural to take into consideration both queries 
on top of vertices labels and queries on top of ver- 
tices and edges labels. As an example, a Google Maps 
user may be interested to specify as a condition a reg- 
ular expression that enforces a stop over in a given city 
and avoids another city while preferring certain types 
of roads. For such a reason, we focus on two other 
models: vertex-labeled graphs or vertex-edge-labeled 
graphs (where both vertices and edges are labeled). 
Surprisingly, for some languages, the RSPQ problem is 
simpler on vertex-labeled graphs than on edge-labeled 
graphs. With L = (ab)* for instance, RSPQ is poly- 



nomial for vertex-labeled graphs and NP-complete for 
edge-labeled graphs. Vertex-edge-labeled graphs obvi- 
ously generalize both edge-labeled graphs and vertex- 
labeled graphs. Furthermore, we can adapt our results 
to prove, for these two models, a classification of the 
same kind as the one shown for vertex-labeled graphs: 
the RSPQ problem is either AC , NL-complete or NP- 
complete. 

As a final contribution, we have obtained two mi- 
nor results. First, we have attempted to study the 
parametrized complexity of tractable RSPQs queries 
when the parameter is the size of the query. However, 
we obtained a partial result: we prove that the prob- 
lem is FPT for the class of finite languages. Moreover, 
we prove that the problem is also FPT for the class of 
all regular languages when the parameter is the size of 
the path. As a second result, we prove that the prob- 
lem RSPQis polynomial w.r.t. combined complexity on 
graphs of bounded directed trecwidth. This is actually 
a straightforward generalization of a result of 23 1. 

Related Work. Regular path queries express ways to 
evaluate regular expression patterns on database graph 
models [T| [tj |8[ |9] [TU| p] p] p] [27] [29] or tree-structured 
models, such as XML [llfTWhile the regular path prob- 
lem has been extensively studied in the literature, the 
regular simple path problem has received less attention 
in both the database and graph communities. Besides 
the works on regular paths, there have been studies on 
finding paths with some constraints. In particular, La- 
paugh et al. [25] prove that finding simple paths of even 
length is polynomial for non directed graphs and NP- 
complete for directed graphs. This study has been ex- 
tended in [2] by considering paths of length i mod k. 
Similarly, finding k disjoint paths with extremities given 
as input is polynomial for non directed graphs [34] and 
NP-completc for directed graphs n6\. Mendelzon and 
Wood (29 show that the regular simple path problem is 
NP-complete in the general case. However, they show 
that the problem can be decided in polynomial time 
for subword-closed languages. They also show that the 
problem becomes polynomial under some restrictions 
on the size of cycles of both graph and automaton. A 



subsequent paper 31 proves the polynomiality for the 
class of outerplanar graphs. Barrett et al. [6] extend this 
result, proving that the regular simple path problem is 
polynomial w.r.t. combined complexity for graphs of 
bounded treewidth. Let us also observe that the ex- 
istence of a regular simple path between two vertices 
is MSO-dcfinable, and therefore a well-known result of 
Courcelle p] already implies the same result but w.r.t. 
data complexity only. Barrett et al. [6] also show that 
the problem is NP-complete for the class of grid graphs 
even when the language is fixed. Practical algorithms 
for regular simple paths on large graphs have been pro- 



posed in 22. M. 
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Regular simple paths have been also investigated in 
the context of SPARQL property paths with the seman- 
tics proposed in a working draft of SPARQL 1.1. Notice 
that such semantics of SPARQL property paths doesn't 
exactly correspond to regular simple paths queries. Lose- 
mann and Martens 28 and Arenas et al. [3] investi- 
gate the complexity of evaluating such property paths. 
They show that the evaluation is NP-complete in sev- 
eral cases, along with exhibiting cases in which it is 
polynomial. More precisely, Losemann and Martens 
consider different fragments of regular expressions and 
classify them with respect to the complexity of evalu- 
ating SPARQL property paths. Both papers also show 
that counting the number of paths that match a regular 
expression (which is permitted by the working draft) is 
hard in many cases. 

2. PRELIMINARIES 

For the rest of the paper, £ always denotes a finite 
alphabet. We use the notation [n] to denote the set of 
integers {1, . . . , n}. Given a word w and a language L, 
w^ 1 L = {«/ : ww' G L}. 

Complexity: NL, P, NP, PSPACE refer to the clas- 
sical classes of complexity 33 . The reductions we con- 



its minimal automaton Al satisfies the following prop- 
erty for every state q € Ql, integer k > 1 and word 



w G £*: A L {q,w k 
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A(q,w) 



q. As con- 



sider are many-to-one logspace reductions 33 and com- 



pleteness of problems are under this type of reductions. 
The class AC refers to uniform AC that is equiva- 
lent to FO(+, x) or FO(BIT, <) (2l]. For definition of 
FPT, see [15]. 

Graphs: In our paper, we essentially consider db- 
graphs even if we consider vertex-labeled graphs and 
evl-graphs (graphs where both vertices and edges are 
labeled). A db-graph is a tuple G — (V, E, E) where V 
is a set of vertices, E is a set of labels and EC Vx"ExV 
is a set of edges labeled by symbols of £. A path 
p of a db-graph G from x to y is a sequence (v\ = 
x } a%, . . . , Vk, a^,Vk+i = y) such for each i € [k + 1], vi 
is a vertex in G and for each i G [k], {Vi, ai, t>i+i) is an 
edge in G. A path p is simple if all vertices Vi in p are 
distinct. Given a language L C £*, p is an L-labeled 
path if ai . . . flfe G L. 

Automata: Let L be a regular language. We denote 
by Al — (Ql,il,Fl,Al) the minimal DFA for L, and 
by M the number of states M = \Ql\ in Al- We as- 
sume that Al is complete i.e. is a total function, 
so that in general Al may have a sink state. For any 
q G Q,w G E*, Al(<7, w) denotes the state obtained 
when reading w from q. Finally, Jt? q denotes the set 
of all words accepted from q. For every state q we de- 
note by Loop(q) the set of all non empty words that 
allow to loop on q: Loop(q) = {w G £ + | A(q,w) = q}. 
Strongly connected components of (the graph of) Al 
are simply called components. There are many def- 
initions of aperiodic languages [37]. We use the fol- 
lowing. A language L is aperiodic if it is regular and 



sequence, for every state q G Ql and word w G £*, 
A L {q,w M+1 ) = A L {q,w M ). 

RSPQ: Given a class C of regular languages and a 
class Q of db-graphs, we define the following problem: 



RSPQ(£,£) 

Input: a language Le£, 
a db-graph G = (V, S, £7) G Q, 
and two vertices x,y *£ V 

Question: is there a simple L-labeled path from x 
to y? 



The encoding of the language L will be specified when 
required. We denote by "All" the class of db-graphs, 
RSPQ(£) means RSPQ(£, All). For a fixed language 
L, we use RSPQ(L,£) to denote RSPQ({L},£). Since 
L is fixed, we focus on data complexity. Notice that 
the representation of L does not matter here. Although 
we consider the boolean version of the problem, namely 
deciding the existence of a path, our algorithms actually 
also return a (shortest) simple L-labeled path. 

Given a regular language L, our main question is to 
give a criterion to decide whether RSPQ(L) is tractable 
(i.e. decidable in polynomial time) or not (i.e NP- 
complete). We address this question in the next and 
following sections. 

Example 1. As an introductory example, consider 
the language L = a*(bb + + e)c* . We wish to decide 
whether there exists a simple path from x to y labeled by 
L, given two vertices x,y of a db-graph G. It is not ab- 
solutely trivial that this problem can be solved efficiently: 
the problem has indeed been proved NP-complete for the 
language a* be* . Yet we can give a polynomial algorithm 
for L. 

First, we check whether y can be reached from x by 
a (non-necessarily simple) path labeled in a*c*. In that 
case we can obtain a simple such path since the path 
obtained after eliminating the loops is still labeled in 
a c . 

Assume now that no a* c* -labeled path can connect 
x to y. Then we can similarly check if there exists a 
simple a* b k c* -labeled path from x to y for k £ {2,3}: 
we check one after another each possible assignment for 
the k middle b-labeled edges: if the initial b-labeled edge 
can be reached from x via an a* -labeled path (avoiding 
the summits of the b-edges) and if the final b-labeled edge 
can reach y through a c* -labeled path, then we obtain a 
simple a* b k c* -labeled path from x to y. This is because 
we assumed there is no a* c* -labeled path from x to y, so 
the a- and c-labeled edges cannot intersect. Observe that 
the number of possible assignment for k edges (k < 3) 
is polynomial. 

Let us now assume w.l.o.g. that there is no a*b k c*- 
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labeled path from x to y for k £ {0, 2, 3}. We can show 
that there exists a simple L-labeled path from x to y if 
and only if there exist six nodes Vi,V2,V3,V4,,Vs,Ve, all 
distinct except that v 3 may equal V4, and for which we 
can find simultaneously: 

• a b-labeled edge from v% to v-i, from v 2 to 1)3, from 
«4 to V5, and from D5 to Vq. 

• an a* -labeled path from x to V\ avoiding all other 
Vi (i > 1) 

• a b* -labeled path from V3 to V4 of which all nodes 
(but the first and last) avoid S a - 

• a c* -labeled path from v 6 to y of which all nodes 
(but the first) avoid S a and Sb- 

where the set S a contains exactly all Vi plus all positions 
reachable from x by some a* -labeled path avoiding all Vi, 
and the set Sb contains exactly all plus all positions 
reachable from V3 through a b* -labeled path that avoids 
all nodes of S a ■ The figure below summarizes all these 
conditions. 

a* b b b* b b c* 

X — — -> Vl — > V2 — > V3 v __ > W4 — > V5 — > ^6 — ' * y 

s a s b cv\s a CV\ (S a u s b ) 

These conditions can clearly be verified in time polyno- 
mial in G. We develop in this paper the general idea 
underlying this argument, which allows us to charac- 
terize tractable instances for the problem of finding a 
simple path. 

3. A TRICHOTOMY FOR RSPQ 

We next define a class of languages. We will prove 
that it is exactly the class of regular languages for which 
RSPQ(L) is tractable. 

Definition 1. For each i > 0, we define trC(i) as 
the class of regular languages L such that for all words 
Wi,W m ,W r £ S* and all non empty words 11*1,102 G S + . 
if Wiw\w m w\w r £ L then wiw\w 2 w r G L. 

We define the class trC as the union [J i>0 trC{i). 

Lemma 1. trC is closed by intersection, union and 
word reversing. 

This result is trivial by definition. The next lemma 
states that we need not consider trC(i) for i > M. 

Lemma 2. The two statements hold: 

• For every i, trC(i) C trC{i + 1). 

• Let L be a regular language and M the size of Ql • 
LetrC iff Le trC(M). 

Proof. 1) Let L £ trC(i) and let wbea word from L 
of the form w — wiw\^ 1 w m w l 2 +1 w r . Then W[wl +1 w i 2 +1 w r 



also belongs to L since w can be decomposed as w — 
(wiwi)w\w m w 2 (w 2 w r ). 

2) Let L G trC and let M be the size of Q L - There 
exists i > such that L £ trC(i). We will prove that 
L £ trC(M). The case when i < M is a consequence of 
the previous statement. Assume that i > M. Let w = 
Wiw±w m W2W r £ L. By the Pumping Lemma for finite 
automata, there are k, k' > such that for every j, f > 
0, Wiiv 1 + ^w m w 2 + ^w r £ L. Thus wiw{w 3 2 w r £ L 
for every j,f > i, since L £ trC(i). Applying once 
more the pumping lemma gives wiwf 1 w 2 4 w r £ L. □ 

Here and henceforth, M refers to the size of Ql- We 
now give a characterization of trC in terms of automata. 

Lemma 3. Let L be a regular language. Then, L be- 
longs to trC iff for every states qi,qz £ Ql such that 
Loop(qi) 7^ 0, Loop(q 2 ) 7^ and q 2 £ A(gi,E*) and 
for every w £ Loop(q 2 ), the following statement holds: 

PROOF, (only if) Let qi,q 2 £ Ql, W\ G Loop(qi) 
and w 2 £ Loop(q 2 ) such that q 2 £ A(gi,S*). Let 
wi,w m ,w r £ S* such that A(i L ,wi) = qi, A(qi,w m ) = 
q 2 and w r £ Jz? 92 . Thus, WiW^ w m w^ w r £ L. By 
definition of trC, wiwf 1 w 2 4 w r £ L and, consequently, 
wifwj. £ Jz? 9l . 

(if) Let L be a language that satisfies the right as- 
sumption of the equivalence. We first prove that L is 
aperiodic. Indeed, let q be a state of Al, vj' £ S* and 
k > such that A(q, w' k ) — q. By applying the assump- 
tion, with qi = q, q 2 = A(q,w') and w — w' k . We ob- 
tain Jz?a(<j,uj') Q ^£q- Symmetrically, with gi = A(q,w'), 
q 2 = q and w = w' k , we obtain Jz? 9 C J^A(q,w')- Thus, 
by minimality of Al, A(q, w') = q. 

Let us now prove L £ trC(2M). Consider some words 
Wi, w m , w r , wi, and w 2 such that wiwl M w m w 2 M w r £ 
L with Wi,w 2 non empty. Let qi — A(iL,uiwf M ) 
and q 2 = A{i L ,uiwl M w m w^). Then w^'w r £ Jf 92 
and, since L is aperiodic, 101 G Loop(qi) and w 2 £ 
Loop(q 2 ). By hypothesis we then get w 2 M w r £ J*? qi , 
so wiwl M wl M w r £ L. □ 

3.1 Hard languages for RSPQ 

This section is devoted to the proof of a hardness 
result: RSPQ(L) is NP-hard for every regular language 
L that does not belong to trC. The first step toward 
that proof lies in the following characterization of trC. 

Lemma 4. Let L be a regular language. L £ trC iff 
L does not satisfy the following property: 
(1) there exist a state q £ Ql, some word w r £ S* and 
some non-empty words w\ £ Loop(q) and w 2 ,w m £ S + 
such that 

• w m w 2 w r C Jf q 

• (wi +w 2 )*w r nif, = 
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The "only if direction is trivial. To prove the "if 
implication, we assume that L does not satisfy Property 
(1) and show the following two claims. 



Claim 1. Let q%,q 2 € Ql such that q 2 £ A(q 1? E*). 
Then, either Loop(qi) n Loopfa) = or J£ q2 C Jz? 9l . 

Proof of Claim [TJ Let q 1 and q 2 that do not sat- 
isfy the claim. Then, we choose wi — w 2 € Loop(qi) n 
Loop{q 2 ) and uv € Jz? 92 \ JSf^- Furthermore, we choose 
w m such that A(qi,w m ) — q 2 . Thus, Property (1) 
holds. □ 

Claim 2. L is aperiodic. 

Proof of Claim [2j Let ^ e Ql and k > such 



that A L (qi,w k 



Let g 2 = A(gi,io). By Claim [T] 

□ 



9l — ^- 92 and thus, by minimality of Al, qi = q 2 
We can now prove Lemma [4] 

Proof. Fix a language L trC. By Lemma [3] there 
exist q, q 2 ,w\,w 2l w m , w r such that w% £ Loop(q), w 2 £ 
Loop(q 2 ), A(q,w m ) = q 2 , and wf^f q2 \ ££ q ^ 0. As 
L is aperiodic, A(g, (w 2 ) M ) = A(g, ((w 2 ) M ) M ) hence 
{(w 2 ) M ) M S£ q2 \££ q ±%. W.l.o.g, we can therefore sup- 
pose that w 2 — (w' 2 ) M for some word w' 2 , and fix some 
word w r £ w^ 1 J£ q2 \ Jz? 9 . Consequently, every state q' 
in A(q, T,*w 2 ) satisfies A(q',w 2 ) — q' (as L is aperi- 
odic) , hence C jSf g by Claim [Tj and consequently 
(wi + w 2 )*w r n Jzf g is a subset of w 2 w r H Jz? 9 = 0. Thus, 
Property (1) is satisfied, which concludes the proof of 
Lemma 01 □ 

We can now prove our hardness result, by reduction 
from Vertex-Disjoint-Path, a problem also used in 
to prove hardness in the particular case of a* ba* . 
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Vertex-Disjoint-Path 

Input: A directed graph G = (V,E), four vertices 
xi,yi,x 2 ,y 2 e V 

Question: are there two disjoint paths, one from x\ 
to 7/1 and the other from x 2 to y 2 ? 



Lemma 5. Let L be a regular language that does not 
belong to trC. Then, RSPQ(L) is NP-hard. 

Proof. As explained, we construct a reduction from 
the Vertex-Disjoint-Path problem. Let q, w m , w r ,w\, w 2 
be defined as in Property (f ) and wi such that A(il, w{) = 
q. We build from G the db- graph G' . 

We consider here db-graphs where edges are labeled 
by non empty words. This is actually a generalization 
of db-graphs. Nevertheless, by adding intermediate ver- 
tices, an edge labeled by a word w can be replaced with 
a path whose edges form the word w. 

G' is constructed as follows. We start from an empty 
graph G' whose vertices are vertices of G. For each 
edge {vi,v 2 ) in G, we add two edges (ui, 101,172) and 
(vi,w 2 ,v 2 ). Moreover, we add two new vertices x,y 



and three edges (x,wi,xi), {yi,w m ,x 2 ) and (y 2 ,w r ,y). 
Note that every simple path from x to y in G' matches a 
word in wi(w\+w 2 )*w r or wi(wi+w 2 )*w m (wi+w 2 )*w r . 

Thus, RSPQ(L) returns True for (G',x,y) iff there 
is a simple path from x to y in G' that contains the 
edge (yi,w m ,x 2 ) that is, iff Vertex-Disjoint-Path re- 
turns True for (G,x\, yi,x 2 ,y 2 ). We illustrate below 
the reduction on an instance (G, x±, yi, x 2 , y 2 ) for L = 
a*b(cc)*d, choosing u>i = W\ = a, w m = b, w 2 — cc, and 
w r = d. □ 



y-2 



■1-2 



Xi 



-* yi 




x > xi 



Input instance G RSPQ instance: graph G' 
Figure 1: Reduction for L = a*b(cc)*d. 

3.2 Tractable languages for RSPQ 

The main result of this section is that for every L £ 
trC, RSPQ(L) £ NL. For this purpose, we first prove 
several lemmas on the structure of automata that rec- 
ognize trC languages. In this section, we fix a language 
L £ trC. Note that L satisfies Claims Q] and [2] 

3.2.1 Technical lemmas on the components of Al 

Here and thereafter, we fix N = 2M 2 . The next 
lemmas give information on the structure of Al com- 
ponents. The first lemma strengthens Lemma [3] 

Lemma 6. Let L be a regular language. Then, L be- 
longs to trC iff for every states q\,q 2 £ Ql such that 
Loop(qi) 7^ 0, Loop(q 2 ) 7^ and q 2 £ A(gi,£*), the 
following statement holds: (Loop(q 2 )) M J? q2 C Jf qi . 

PROOF. 2 =>■ 1 is trivial. 
I => 2: Let qi and q 2 two states such that Loop(qi) 7^ 
and q 2 £ A(<7i,£*). Let v\,...vm € (Loop(q 2 )) M and 
q-i = A(gi,Wi . . . vm)- We wish to prove ££ q2 C Jz? 93 . 

For some i, j, 1 < i < j < M, we get A(<7i, Vi . . . Wj) = 
A(qi, v\ . . . Vj) (using the convention A(qi, v\ . . . Vi) = 
qi for i = 0). Let u\ — v\ . . .Vi, u 2 = Vi+\ . . . Vj and 
u 3 — Vj+i ■ ■ - Vk- Let q4 = A(qi,ui). Since J£ q2 = 
Ug 1 ^f 92 and J? q3 = u^ 1 ££ qi , it suffices to prove that 
-^2 C Jgf g4 . Let w = u x uf and q 5 = A(q 1 ,w M ). Both 
u 2 and w belong to Loop(q§) because L is aperiodic. As 
A(qi,w M ) = q 5 and w £ Loop(q 2 ), we get Jz? g2 C jSf g5 
through Lemma|3]with qi, q 2 and w. As A(g , 4, u 2 r ) = q^ 
and u 2 £ Loop(qc,), one more application of Lemma [3] 
with q 2 , (74 and u 2 yields ^f„ 5 C -Sf„ 4 . □ 
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This implies that two distinct states q\ and q 2 in a 
same component cannot loop on a same word. 

Lemma 7. Let L be a regular language in trC . Let 
q±, <72 two states belonging to the same component of Al. 
Then, Loop{q\) (1 Loop (q 2 ) 7^ implies q\ = q%. 

Proof. Let (71,92 as above, and let w a word in 
Loop(qi)r\Loop(q 2 ). According to Lemma [6j w M J^f q2 C 
since w G Loop(qi). By symme- 
try, Jz? 92 = Jz? gi , which implies 0/2 = q\ . □ 

Lemma 8. Let C be a component of Al, (71,(72 G C 
and aeS. Then A^(qi, a) G C iff AL(q 2l a) € C. 

Proof. Let qi 7^ (72 two states in the same com- 
ponent. Assume by contradiction that Ai(gi,a) G C 
and A^((72,a) ^ C. Notice that Loop(qi) and Loop(q 2 ) 
are not empty. Let w G Loop(qi) (1 a£*. Let 53 = 
Al(<72,w M )- As L is aperiodic, Ai(g 3 ,w M ) = 53. Thus, 
w M G Loop(qi) n Loop(q 3 ). Consequently, w M Jz? Q3 C 
cS^q-^ cmd zf^ <J^^g-^ ^— °^^q'3 ■ "^^^s ^~ cmd <=5^g-^ ^— oS^^g . 
Thus, Jzf gi = Jz? 93 and, by minimality of (71 = g 3 . 
That is an absurd because (71 and (73 are not in the same 
component. □ 

Notation 1. We define the internal alphabet of a 
component C of Al as the set Sc = {a G £ : 3q±,q 2 G 
C.Ai( ?1 ,a) = g 2 }. 

As a direct consequence of Lemma [5] we get: 

Lemma 9. Let C be a component of Al, q G C and 
w G £*. T/ien A(g,u>) G C iff w E (S c )*. 

Lemma 10. Lei C be a component of Al, £c oe fie 
internal alphabet of C, (71,(72 &e two states of C and 
w G (S c ) m2 - r/ien, A L ( gi ,w) = A L (g 2 ,w)- 

Proof. Assume that w = ai . . . om 2 - For each i = 
0, . . . , M 2 and a = 1,2, let g Qi i = Ai(g a , ai . . . dj). 
Since there are at most M 2 distinct pairs (51,1,32,1)1 
there exist i, j, with i < j such that qi^ = q\ j and 52.4 = 
q 2 j. Let 11/ = a^ + i . . .dj. We have w' G Loop(qii) n 
Loop{q2 t i). Thus, by Lemma [9j 51,1,(72,1 G C and, by 
Lemma [7| g M = o 2 ,i and A L (gi,w) = A L (q 2 ,w). □ 

Lemma 11. Lei (71,(72 oe iwo states such that q 2 G 
Al((7i,E*), Loop(qi) 7^ 0, and Loop(q 2 ) ^ 0. Lei C oe 
i/ie component that contains q 2 and Sc be the internal 
alphabet of C . Then, if 92 n (EcJ^S* C jSf gi . 

Proof. Let w g Jz? 92 n (E C ) W E*. There are some 
words u, v G (Sc) M , f/ G £* such that w = uvw' . By 
Lemma [8] and the Pigeonhole Principle, there exist a 
state g 3 and M + 1 non-empty words wi , . . . , «m+i such 
that v = v\ . . . vm+i and A^((72, ut>i . . . Vi) = (73 for ev- 
ery i G [M]. Therefore, w G uvi(Loop((73)) M ~ 1 UM+i u; '- 
By Lemma [l0| Ai(g 3 ,u?;i) = A L {q 2 ,uvi) = 53, Thus, 
w G Loop(q3) M vm+iw' (1 -2g 3 - By Lemma pi u> G 
□ 



3.2.2 Computing RSPQ(L) /or L in irC 

In the following, we describe a polynomial algorithm 
that computes RSPQ(L) when L belongs trC . Observe 
that if we are looking for a (non necessarily simple) 
regular path, a dynamic programming approach can be 
used, essentially because only the last vertex in the (par- 
tial) path needs to be memorized in order to build the 
path incrementally. This approach is not adequate to 
build a simple path, as we need to memorize all the 
vertices in the path. We therefore need to consider an 
exponential number of paths. 

Nevertheless, we will show that in the case where 
L belongs to trC, we can identify a finite number of 
vertices that suffice to check if the path is (or can be 
transformed into) a simple path labeled with L. These 
" important" vertices shall be stored in a path summary, 
as presented in the following. Unlike paths, summaries 
can be enumerated in logarithmic space, and we shall 
explain how one can use the summaries to check if there 
exists a simple path between the input nodes. 

We first define the notion of L-annotation of a path 

P- 

Definition 2. Let L be a language in trC, and let p 
be a path p = (vi, <Zi, . . . , a m , v m +i)- the L-annotation 
of p is the mapping p : {v%, . . . , w m +i} Ql such that: 
p{vi) = iL and p(v i+ i) = A l (il, &i ■ ■ • ai) for every 
i G [to]. 

We now introduce the concept of summary for a path 
p (with annotation p). Roughly speaking, the idea is 
to keep only a bounded number of vertices of p (that 
depends only on L). Actually, the information we must 
record for each component C of Al can be limited to 
the first and the N last vertices having their state in C. 
This allows us to apply Lemma [TT] Additionally, if the 
number of such vertices is greater than N+l, we replace 
the path between the first vertex and the N last ones by 
a special symbol S c where is the internal alphabet 
of C. It means that the path we have removed forms a 
word that belongs to E c . More formally, a summary is 
defined as follow. 

Definition 3. Let p = (vi, ai, . . . ,ak, v m +i) be a 
path and let p the L-annotation of p. The summary 
S of the path (p, p) (w.r.t. Al) is obtained from p (and 
p) by the following replacements. Let Ci, . . . ,Ci be the 
components such that there are at least N + 1 vertices v 
in p such that p(v) G Ci (the sequence is sorted in topo- 
logical order). For each component i G [/], let ai and 
Pi denote the first and the maximal indices such that 
p{ v ai), p{ v i3i) € Ci. Let (3[ = j3i — N. We replace the 
subpath v ai . . . Vp{ by v ai , E c , Ug< . We denote by lrc{p) 
the set {Ci, ...,(?;} of long run components. 

Notice that a summary contains at most NM = 2M 3 
elements (vertices and labels), which is constant if L is 



G 



fixed. Consequently, given a db-graph G, the number of 
distinct summaries of L-labeled paths in G is bounded 
by a polynomial in \G\ (when L is fixed). Furthermore, 
each vertex of the graph can be represented with a loga- 
rithmic number of bits. A logarithmic number of bits is 
therefore sufficient to encode a summary (for fixed L). 

We define a candidate summary S as a sequence of 
vertices and labels of the form above; S — {v\ , ol\ , . . . , ai , 
v l+1 ) where a.eSU {Y,* c \ C} and I < NM. A path p 
obtained by replacing each subsequence (v ai , , vp>) 
with a simple E^. -labeled path from v ai to Ug; is called 
a completion of the candidate summary S. 

Example 2. Figure 2 represents the minimal DFA 
for L = a(c- 2 + e)(a + b)*(ac)la* (we did not repre- 
sent the sink state). This automaton can loop in three 
strongly connected components: C\ = {94}, C2 = {95, qe}, 
and C3 = {97}. The accepting states are q2, 04, 95, qe, 
and q 7 . In this automaton M — 7, so N should be huge 
but we shall pretend that N = 3 for our example as this 
value is sufficient for our algorithm. Let us consider 
the path p\ illustrated in Figure 3 with thick edges. The 
table below details this path and the corresponding an- 
notation. 

We observe that p\ is a simple L-labeled path. The 
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^5 


V 7 


^8 


Vg 


V10 
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V13 


Vl4 
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qi 


92 


93 


94 
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94 
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94 


(95 


96 


95 
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annotations in G\ annotations in C 2 



summary S of pi is obtained by removing the second 
(resp. second and third) vertex with state annotated in 
C\ (resp. C2): only the vertices highlighted in red in the 
table are preserved, and the component for the vertices 
eliminated is identified by special symbols Ti* c and E^ 2 . 

S =(vi, a, v 2 ,c, v 3 ,c, V4, , v 7 , c, v 8 , c, Vg, 
a, uio,Scf a ,«i3,a,t;i4,o,ui5). 



a 






91 * 92 > q-i > <74 6 a 97 




Figure 2: Minimal DFA for a(c^ 2 +e)(a+6)* (ac)?a* 

Lemma 12. Le< 5 be the summary of an L-labeled 
path p and let pi be a completion of S . Then, p' is an 
L-labeled path of summary S. 




V 15 i l>14 < «13 i — A V12 



a a a 



Vj > Vi : element of summary 

I Vi I : node in acc(l) 

Vi : node in acc(2) 



Figure 3: Nice simple path, and its summary. 

This gives an NL algorithm to test if a given candi- 
date summary S is the summary of an L-labeled path: 
we only need to compute a completion and then test 
if this completion is an L-labeled path. However, the 
completed path is not necessarily simple, even if S is 
the summary of a simple path. Indeed, the paths we 
have built between each v ai and Vpi are not necessarily 
disjoint. To overcome this problem, we will restrict the 
set of potential paths and we will only search a certain 
type of path that we name nice path. 

Definition 4. Let p be an L-labeled path of run p 
and summary S. Let F be the set of vertices appearing 
in S. Let (Ci, ai, as stated in Definition^ 

We define Pi, length i and acc(i) as follows for every i € 
[I] in increasing order: Pi is the set of paths p' starting 
from v ai and satisfying the following three conditions: 

1. p' is a simple E^. -labeled path; 

2. there is no vertex inp' that belongs to F\{v ai , vp>} 

3. furthermore, there is no vertex in p' that belongs 
to acc(j) for any j G [i — 1]. 

We define length i as the length of the shortest path from 
v ai to vpi that belongs to Pi. And acc(i) is the set of 
vertices y reachable from v ai by a path p € Pi of size 
w(p) < length i . 

We qualify p as nice if the following three conditions 
are satisfied: (a) all vertices appearing in S are distinct 
and (b) the subpath pi of p from w Q i+i to vp> belongs to 
Pi and (c) w(pi) — length i . Consequently all vertices of 
Pi belong to acc{i). 

Example 3. The path p\ defined in Example [^| is 
nice, since occ(l) = {v 5 ,v§} and acc(2) = {^11,^12}; 
as illustrated in Figure [3| Observe that neither v§ nor 
vq are necessary to fill the gap from V4 to v 7 in the 
summary, yet one of them is necessarily traversed. The 
definition of nice paths guarantees the paths p'[ and p' 2 ' 
replacing and Ec 2 are disjoint. 
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One might think the intersection of the two paths p'[ 
and P2 does not matter, as each loop can be eliminated 
while preserving the membership of the label in L, ac- 
cording to Lemma In this example this is indeed 
the case: the path through ui, V2, 1)3, V4, Vq, V13, 1*14, and 
vis belongs to the language L. But Example [4] shows 
this needs not be the case in general. 

Example 4. The language L = a*(bb + + e)c* from 
Example^shows why one cannot simply eliminate loops 
by repeated applications of Lemma 11 The minimal 



DFA for L has 4 states so N should be 2 * 16 but here 
we could assume N = 2. The L-labeled path illustrated 



in 



Figure [7J consists of an a 2N path from xq to X2k, 
a c 2N -labeled path p y from yo to y^h, plus a b 2N -labeled 
path from X2k toyo, intersecting p x andp y in their mid- 
dle after N and N + 1 letters, respectively. This path is 
not simple as it intersects twice with itself. Lemma [77] 
can be applied once to remove any of the two loops but 
then the remaining loop cannot be removed while pre- 
serving the language. The notion of nice path has been 
introduced to tackle exactly this difficulty. 



x 

V2k 



a *a 

— > x k 

16 

— Vk < — 

c x c 



k > N 



X2k 

- yo 



k > N 



Figure 4: Counterexample to loop-elimination. 

Lemma 13. Let p be an L-labeled path from x to y. 
If p is nice, then it is a simple L-labeled path. 

Proof. This is because for any nice path the nodes 
of its summary are distinct, the path inside a component 
is simple due to condition 1, and the sets acc(i) are 
pairwise disjoint due to condition [3j □ 

We observe that, although condition 1 requires that 
the path inside a component are simple, we can easily 
obtain a nice path from any p that violates no condi- 
tion but 1 by removing the loops inside a component. 
The whole point of the definition is to make sure that 
there are no loops in the path except inside a connected 
component. 

The following lemma explains that finding a simple 
path is equivalent to finding a nice path. 

Lemma 14. Let (G,x,y) an instance of RSPQ(L). 
Then, every shortest simple L-labeled path from x to 
y is nice. 

Proof. Let p = (vi, ai, . . . , a m , v m+ i) be a shortest 
simple L-labeled path from x to y. We use the nota- 
tions (C,-, Q!i, /8i)jg[j] as stated in Definition [3] for the 



long run components in the summary S of p and their 
corresponding indices. 

Assume for the sake of contradiction that p is not 
nice. There exists i £ [l],j £ [l],k £ [to] such that 
i < j, ai < k < and Vk £ acc{i). We choose i as 
minimal and then k as maximal for this property. Let 
Pi be the subpath of p from v ai to Vk- By definition 
there is a E^.. -labeled simple path P2 from v ai to Vk 
whose vertices belong to acc(i). We define p' from p 
by replacing the subpath pi by p2- Let p' be the L- 
annotation of p' . Notice that p{vk) £ Cj and p'(vk) £ 



Ci. Furthermore 
Lemma 
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a k .-.a m £ E Cj .E* n Jf p (. Ufc ). Thus, by 
ak-..a m £ J£ p ii Vk )- Consequently, p' is a 
L-labelecTpath. 

We now prove that p 1 is simple. Since p is simple, 
it suffices to prove that the vertices in j? 2 are disjoint 
with other vertices in p' . By minimality of i, the sets 
acc(l) , . . . acc(i) are pairwise disjoint. By maximality 
of k, for every k' > k, Vj £ acc(i). 

We finally observe that p' is strictly shorter than p 
since w(p2) < w(pi). It is in contradiction with the 
minimality of w(p). □ 

We next show how a nice summary can be completed 
in logarithmic space into a simple path. 

Lemma 15. Let L be a fixed language in trC. There 
exists a non deterministic log-space algorithm that given 
an instance (G,x,y) of RSPQ(L) and a candidate sum- 
mary S 

1. returns "Yes" if there is a shortest simple L-labeled 
path of summary S from x to y; 

2. returns "No" if there is no simple L-labeled path 
from x to y; 

3. is unspecified otherwise. 

Proof. It suffices to complete the summary S into 
a nice path p. If S is a summary of a nice L-labeled 
path then p is a L-labeled path by Lemma [12] and is 
simple by Lemma 13 If such path do not exist, the 
algorithm returns "No" . This can be done in logarith- 
mic space by directly applying the definition. To this 
purpose, we need to compute the sets acc(i). These 
sets acc{i) can be described in FO + TC i.e. FO plus 
the transitive closure operator. Since the evaluation of 
a fixed FO + TC formula is in NL 20 , the sets acc(i) 



can be computed using a logarithmic space. Notice that 
the sets acc(i) are not stored in memory but computed 
each time we need to access them. The same remark 
applies to the path p. The candidate summary S may 
not be a summary of a simple L-labeled path. Thus, we 
must check that the obtained path is a simple L-labeled 
path. Notice that the algorithm possibly returns "Yes" 
even when S is not a summary of a shortest path. □ 
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We eventually show the main Lemma of this section. 



Lemma 16. Let L G trC. Then, RSPQ{L) G NL. 

Proof. We simply enumerate all possible candidate 
summaries S w.r.t. (L, G, x, y), and apply on each sum- 
mary the algorithm of Lemma [15] We return "Yes" if 
this algorithm returns "Yes" for at least one candidate 
summary S. Otherwise, we return "No" . Therefore, our 
algorithm returns "Yes" if and only if there exists a nice 
path from x to y, and consequently, if and only if there 
is a simple path from x to y. Since L is fixed, there is 
a polynomial number of candidate summaries, each of 
logarithmic size. Consequently, they can be enumerated 
within logarithmic space. □ 

Notice that we can easily adapt our algorithm such 
that it outputs a shortest path for positive instances. It 
can be generalized to db-graphs weighted by a function 
E ->-R+ 

We can now state the main theorem. 

Theorem 1. Let L be a regular language. Then, 
RSPQ(L) is in NL if L G trC and is NP-complete 
otherwise. 

3.3 Towards a complete classification 

Actually, the classification can be made more precise. 
We have divided the RSPQ(L) problems into NL and 
NP-complete problems. Now, we can envisage a classi- 
fication within the class of NL problems. 

Lemma 17. For every regular language L, RSPQ(L) G 
AC if L is finite and is Nh-hard otherwise. 

The proof is based on a reduction from the following 
NL-complete problem 33 



Reachability 

Input: A directed graph G and two vertices x, y in G 
Question: is there a path from x to yl 



Proof. (Easiness) Given an alphabet S, we consider 
the signature r = (-R )a£s) 01 binary predicates. We 
can view a db-graph (V, S, E) as a r-structure M. = 
(V, (R a )aes) of domain V and such that (vi,v 2 ) G R a 
iff {v\,a,V2) G E for every wi,«2 G V and a G S. Let 
w = a± . . . dk be a word. We let the reader verify that 
the predicate path^ [x, y) (which means there is a simple 
w-labeled path between x and y) is expressible in FO. 

(Hardness) We make a a reduction from Reachability. 
Let L be an infinite regular language. By the Pump- 
ing Lemma, there exist non empty words u, v, w such 
that the language uv*w C L. We build an db-graph G' 
from G by labeling each edge of G with v. We add two 
vertices x' and y' and two edges (x', u, x) and (y, w, y'). 



There is a (simple) path from x to y in G iff there is 
a i-labeled simple path from x to y. Consequently, 
RSPQ(L) is NL-hard. □ 

Our results so far can now be summarized by the 
following trichotomy which refines Theorem [l] 

Theorem 2. Let L be a regular language. One of 
these statements hold. 

1. L is finite: RSPQ(L) G AC ; 

2. L G trC and L is infinite: RSPQ(L) is NL- complete; 

3. L^ trC: RSPQ(L) is NP-complete. 

3.4 Recognition of tractable languages 

This section investigates the complexity of deciding if 
RSPQ(L) is tractable (i.e. if RSPQ(L) can be computed 
in polynomial time). We consider different representa- 
tions of L (DFAs, NFAs or regular expressions). 

Theorem 3. Testing whether a regular language L 
belongs to trC is: 

1. NLi-complete if L is given by a DFA; 

2. P SPACE -complete if L is given by an NFA (resp. 
a regular expression) . 

The proofs of hardness for DFA and NFA rely on 
reductions from the following two problems respectively. 



Emptiness 




Input: A DFA A L = (Q L , S, i L , F L , A L ) 


that recog- 


nizes a language L 




Question: is L = ? 




and 


Universality 




Input: An NFA (or a regular expression) 


that recog- 


nizes a language L C {0, 1}* 




Question: L = {0, 1}* ? 





The NL-completeness of Emptiness can easily be de- 



duced from the NL-completeness of Reachability 33 
Stockmeyer and Meyer 38 prove that Universality is 



PSPACE-complete. 

3.5 Characterization by regular expressions 

In this section, we propose a characterization of trC 
languages in terms of regular expressions: the languages 
in trC are exactly those that can be expressed with an 
expression in the fragment ^> tr defined below. This frag- 
ment essentially enforces restrictions on the concatena- 
tion of subexpressions: except at the highest level, only 
expressions of the form e + e can be concatenated. 

^/tr-terms are defined as follows: 

* fr -term ::= w + e I A- k + e 
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where w is a word and A is a subset of E and A- k is a 
shortcut for A k A*. A 'I'tr-sequence is a concatenation 
of terms w<p.\ . . . ipiw' where w and w' are words and 
(fx . . . ipi are 'J'^-terms. Finally, the fragment ^ t r is the 
set of disjunctions of "J^-sequences. 

Theorem 4. A language L belongs to trC iff L is 
recognized by a regular expression in ^tr- 

Proof. We leave the details for the Appendix. Prov- 
ing that expressions in \t tr define tractable languages is 
relatively straightforward. Lemmas 10 and 11 are the 



cornerstones that allow to construct the expressions for 
the other direction. □ 

We observe that adapting the notion of summary allows 
for a proof of Lemma [16] that directly considers regular 
expressions in ^ tr . Since trC is closed by union, we 
can restrict ourselves to vE^-sequences ipi . . . ipi where 
ipi, ipi are words and ip2, ■ ■ ■ , <pi-i are \I/ tr -terms. Let p 
be a L-path. We decompose p into subpaths pi,...pi 
such that Pi matches the expression (p t for every i 6 [I]. 
The summary of p is built as follows: 

• if tfi is a word or an expression of the form w + e, 
we keep all vertices of pi in the summary; 

• if ipi is an expression of the form A- k + e, we keep 
the k first and k last vertices and replace the rest 
of the path by the symbol A* . 

4. OTHER RESULTS 

This section investigates three further issues. First, 
we consider RSPQs over vertex-labeled graphs. Then, 
we give minor results on the parametrized complexity of 
the RSPQ problem, Finally, we discusses the complexity 
of RSPQs over graphs of bounded directed treewidth. 
These are straightforward applications of standard tech- 
niques, yet the results may be of practical interest. 

4.1 Other models of database graphs 

The goal of this section is to adapt our classification 
to two other models of graphs: vl-graphs i.e. graphs 
whose vertices are labeled, and evl-graphs where both 
vertices and edges are labeled. We denote by vlg the 
class of vl-graphs and evlg the class of evl-graphs. 

For simplicity, we will consider vl-graphs and evl- 
graphs as special db-graphs. This will let us work with 
unique model and definitions. For vl-graphs, we can put 
the label of a vertex into edges. Consequently, we see 
a vl-graph as a db-graph that respects the following re- 
striction: there exists no pair of edges e = (x, a, y) and 
e' = (x',a',y) such that a =^ a' . Similarly, a vlc-graph 
can be seen as a db-graph over an alphabet Ey x Eg 
where Ey is the set of vertices labels and Eg is the set 
of edges labels. 

Clearly, given a language L, RSPQ(X, vlg) is at most 
as difficult as RSPQ(i). However, for some languages, 



the problem is easier. For example, for L = a* be*, 
RSPQ(L, vlg) £ Ptime while RSPQ(L) is NP-complete. 
The key is that a vertex cannot have two different la- 
bels, and, consequently, a path that matches a* is al- 
ways disjoint from a path that matches c*. By contrast, 
for L = a*ba* or L — (aa)* , the problem remains NP- 
complete. 

By generalizing this, we can define the class trC v i g 
that is the equivalent of trC for RSPQ^. The idea 
is that we can restrict the definition to consider only 
words w\ and u>2, whose last letter is identical. 

Notation 2. We define the relation = v i as follows: 
Wi = v i u>2 if there exists a label a G E such that u>i,ui2 € 
E*a. For every label a €E E and state q € Ql, we define 
Loop a {q) — Loop(q) f] Ea*. 



Definition 5. For each i > 0, we define trC v i g (i) 
as the class of regular languages L that satisfy the fol- 
lowing condition for every words Wi,W2,wi,w m ,w r £ 
E* such that w\ = v i ui2-' if wiw\w m w 2 w r € L then 
wiw{w2W r € L. We define the class trC v \ g as the union 

As for db-graphs, we obtain the following dichotomy. 

Theorem 5. Let L be a regular language. Then, 
RSPQ vtg (L) is in NL if L G trC v i g and is NP-complete 
otherwise. 

Only minor changes are required from the approach 
for db-graphs. Proofs are not provided here but will be 
given in an extended version of this paper. The three 
main differences arc the following: 

• in every proof where words u>i and u>2 appear, we 
consider that wi = v i u>2; 

• instead of considering two states q\ and 92 such 
that Loop(qi) ^ and Loop{q2) ^ 0, we con- 
sider two states q\ and 52 and a label a such that 
L oop a (qi) ^ and Loop a (q 2 ) ^ 0; 



• the special symbol between v ai and in sum- 
maries is no longer of the form E^, but \(v a .)~ 1 Lc i 
where Ld is the internal language of C^. Similarly, 
the paths Pi is the definition of a nice path must 
be X(v ai )' 1 Lc i -labeled graphs. 

We can obtain a similar result on evl-graphs. We 
define a relation = ev i over words of the alphabet E = 
E-u x E e . u>i = ev i w 2 if there exists (a v , a e ), (a„, a' e ) € E 
such that wi € ^*(a Vl a e ) and u>2 E E*(a„,a' e ). 

Definition 6. For each i > 0, we define trC ev i g (i) 
as the class of regular languages L that satisfy the fol- 
lowing condition: for every words Wi,W2,wi,w m ,w r E 
E* such that wi = ev i w 2 and wiw\w m ui2W r E L, it 
holds vJiw\w l 2W r E L. We define the class trC ev i g as 
the union U i>0 trC ev i g (i) . 
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Theorem 6. Let L be a regular language. Then, 
RSPQ(L, evlg) is in NL if L 6 trC ev i g , and is NP- 
complete otherwise. 

4.2 Parametrized complexity 

The next section focuses on the parametrized com- 
plexity of the RSPQ problem. 



para-RSPQ(£) 

Input: a db-graph graph G = (V, E, E), 

a regular language L £ C given by an NFA Al = 

(Q L ,iL,F L ,A L ) 

Parameter: \Ql\ 

Question: Is there a simple L-path of size at most k 
in Gl 



Our initial goal was to determine the parametrized 
complexity para-RSPQ(fcrC). Unfortunately, we could 
only partially reach this goal. We first address the 
parametrized complexity of RSPQs when the param- 
eter is the size of the path. 



k-RSPQ 

Input: a db-graph graph G = (V, E, E), 

a regular language L given by an NFA Al = 

(Q L ,i L ,F L ,A L ), 

two vertices x and y an integer k > 
Parameter: k 

Question: Is there a simple L-labeled path of size at 
most k from x to y in Gl 



Theorem 7. k-RSPQ is FPT. More precisely, the 
problem is solvable in time 0{2°^\A L \ ■ \G\ • log |G|) . 

The proof is based on the Color Coding method [2] . 
As a consequence of this theorem we get: 

Corollary 1. Let C be the class of finite languages. 
Then para-RSPQ{C) G FPT. 

The finite language can be given by an acyclic NFA 
or a star-free regular expression. 

4.3 Directed treewidth 



23 . It 



Directed treewidth is a notion introduced in 
generalizes many other measures such as treewidth, dag- 
width or Kelly-width [7j [19 



Directed treewidth mea- 
sures in some sense how close a digraph is to a DAG. 
Johnson et al. [23] present a general method to design 
polynomial algorithms on graphs of bounded directed 
treewidth. Like most algorithms exploiting treewidth, 
this method leverages a dynamic programming approach 
on the decomposition tree. They apply this method to 
show that testing the existence of an hamiltonian path 
is polynomial on such classes of graphs. Here, we ex- 
tend this result to show that the regular simple path 
problem is also computable in polynomial time for the 
same classes. 



It has been observed in the literature that RSPQ 
has polynomial combined complexity on two interesting 
classes of graphs: graphs of bounded treewidth [6], and 
DAGs 29 . The result for DAGs is immediate indeed, 
as every path in a DAG is simple. The next theorem 
generalizes both these two results. 

Theorem 8. Let k > and Q be a class of db-graphs 
with directed treewidth at most k. Then, RSPQ(Reg, Q) 
is polynomial, where Reg denotes the regular languages. 

5. CONCLUSION 

We now pinpoint some directions for future work. 

• As an extension of our work, we can consider context- 
free languages. It seems to be difficult. 

• We have studied the regular simple path problem 
from the data complexity perspective. An inter- 
esting continuation of our work is to include the 
language in the input (combined complexity). The 
question is to decide given a class of language C 
whether RSPQ(£) is in P or NP-complete. Mendel- 
zon and Wood prove that RSPQ(£) is in P for the 
class C of languages closed by subwords, which 
actually corresponds to trC(0). By Theorem [I] 
C C trC is a necessary condition to get polynomial 
combined complexity. It is not a sufficient con- 
dition because the problem can be NP-complctc 
even for a class of finite languages. We conjecture 
that a sufficient condition is that there exists i > 
such that L C trC(i). It is not clear whether this 
condition is necessary. 

• What becomes tractable under restrictions to the 
graph such as planar digraphs or undirected graphs? 
Notice that both disjoint paths and even path prob- 



lems are polynomial in these cases 25 30 34 36 



• From the parametrized complexity perspective, what 
is the complexity of para- RSPQ (trC)? We conjec- 
ture that it is in FPT. 
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Theorem [3] Testing whether a regular language L be- 
longs to trC is: 

1. NLi-complete if L is given by a DFA; 

2. P SPACE -complete if L is given by an NFA (resp. 
a regular expression) . 

Proof. 1) Easiness: The proof is based on the char- 
acterisation of Lemma [6] Let Al = {Qli S, ij,, Ql, A £ ) 
be a DFA that recognizes L. We first observe that there 
is an NL-reduction from this problem to the case where 
Al is minimal. The idea is that checking whether two 
states qi and q 2 are Nerode-equivalent i.e. J*f ?1 = Jz? 92 
is in NL. For this, we can build on-the-fly the automa- 
ton for the language Jf 9l \ Jf 92 U Jz? 92 \ S£ qx to check its 
emptiness. 

We assume now that Al is minimal. We only need 
to test for each pair of states q\ and q 2 whether q 2 € 
A £ ( ?1 ,E*), Loop( qi ) ^ and Loop(q 2 ) M J? Q2 \ S? gi = 
0. The first and second statements are easily veri- 
fied using an NL algorithm for transitive closure. For 
the third statement, we build a DFA that recognizes 
Loop(q 2 ) M L{q 2 )\L{ qi ). A DFA for Loop(q 2 ) M 3? q2 can 
be built as follow: we make M copies (At, . . . , Am) of 
Al- For each copy A, i G [M], and each transition 
from a state q' to q 2 inside Ai, we replace that transi- 
tion by a transition from the state q' of Ai to the state 
q 2 of A+i- In the automaton obtained, we choose the 
state q 2 of Ax as the initial state and the final states 
of Am as final states. It can be easily checked that the 
construction can be done in logspace and similarly for 
the DFA that recognizes Loop(q 2 ) M J2? 92 \ J£ qi . As be- 
fore, the emptiness of this automaton can be checked 
using an NL algorithm by using transitive closure. 

Hardness: we show a reduction from the Emptiness 
problem. Let L C E* be an instance of Emptiness, 
given by a DFA A L = (Ql, %l, Ql, A l ). W.l.o.g, we 
assume that e ^ L since this can be checked in constant 
time. Furthermore, we assume that the symbol 1 does 
not belong to E. Let V = 1+L1+. A DFA A L , that 
recognizes L' can be obtained from Al as follows. We 
add a state qj that will be the initial state of Al> and a 
state qp that will be the unique final state of Al< ■ A £ / 
is the extension of A^ defined as follows: 

• A L '(qi, 1) = qi and A L '{qi, a) = i L for every sym- 
bol a e E. 

• For every final state q £ Fl, ^L'{q, 1) = qp- 

• A L '(qF, 1) = qF- 

If L is empty then L' = belongs to trC. Assume 
that L is not empty. Let w E L. Then, for every M, 
l M wl M e V and 1 M 1 M i L'. Thus V $ trC. 

2) Easiness: we first observe the following fact: let 
A, B be two problems such that A £ NL and let t 
be a reduction from B to A that works in polynomial 
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space and produces an exponential output. Then B be- 
longs to PSPACE. We can apply this technique here 
since we can obtain a reduction of that kind from an 
instance given by an NFA or a regular expression to an 
equivalent instance given by a DFA. Indeed, this can be 
achieved using the classical powerset construction for 
determinization. 

Hardness: We use a reduction from the Universality 
problem. Let L C {0, 1}* be an instance of Universal- 
ity given by an NFA or a regular expression. Consider 

V = (0 + l)*a*ba* +La* over the alphabet {0,1, a, b}. 
Our reduction associates L' to L and keeps the same 
representation (NFA or regular expression). 

If L = {0, 1}*, then V = (0 + l)*a*(b + e) and thus 

V £ trC . Conversely, assume L ^ {0,1}*. Let w £ 
{0, 1}* \ L. Then, for every M, wa M ba M £ L' and 
ua M a M £ V. Thus L' ^ trC. □ 

Theorem [4] A language L belongs to trC iff L is rec- 
ognized by a regular expression in ^tr- 

We prove separately each direction in the next two 
lemmas. 

Lemma 18. Every language L in trC can be repre- 
sented by a regular expression in the above fragment 

Proof. We next outline an algorithm to build the 
regular expression e from A i . Let C\,...,Ci be the 
strongly connected components of L in some topological 
order. For every k £ {0, . . . , 1} and every sequence 1 < 
ji < ■ ■ ■ < jk < I, we denote by . . . ,jk] the set of 
all words from L that stay for at least 4M 2 steps in each 
component Cj 1 Cj k , and stay for at most AM 2 — 1 
steps in the other components. 

Clearly, L is the union of all L\ji, . . . ,jk] over all 
sequences jx,...,jfc. We next show how to build an 
expression for L\ji, . . . , jk]- We denote by Si, . . . , S/. 



the components Cj 1 , 



C 



Jk and by E lf . 



,Ei. thei: 



alphabet. For any component Si and state q in Si, we 
can easily build an expression H(q) for the following 
language: H{q) = {w £ (S 4 ) 2m2 | 3q a £ Si,A(q Q ,w) = 
q}. The rationale for this definition is that when a word 
w £ (T,i) M S* is matched from any state of Si, the final 
state in Si after matching w is determined by the last 
M 2 letters of w, according to Lemma 10 in particular 
for all q,qi £ Si and w £ H(q), we have A(qi,w) = q. 

Let i < k and q £ Si. We build an expression W(q) 
for the set of all words w that lead from q to some state 
of Si + i while respecting the sequence of components. 
In other words, a word w = ax . . . a m belongs to W{q) 
iff when we denote by qj the state A(q, a\ . . . a,j), the 
sequence q\ . . . q m satisfies the following properties: Q 

x We require somewhat arbitrarily that the first letter of w 
lets quit Si, while the last letter of w let enter Si+i (i.e., is 
not in E;+i). 



• qi <j£ Si 

• q m is the first state of the sequence that belongs to 
Si+i 

• there are at most AM 2 — 1 states qj in a same com- 
ponent of Al ■ 

W(q) is a finite set of words having length at most AM 3 . 

Similarly, for i = k, we build for any state q £ Sk 
an expression W(q) for the set of all words w that lead 
from q to some final state while respecting the sequence 
of components, i.e., satisfying conditions similar to the 
above ones except that q m belongs to instead of 
Si+i. W(q) is a finite set of words having length at 
most AM 3 . 

If it, belongs to Si, we define the expression e^t as e, 
otherwise ei n ;t is the set of all words that lead from i^ to 
some state in Si while respecting the sequence of com- 
ponents. Rephrased differently, a word w = a\ . . . a m 
belongs to W(q) iff when we denote by qj the state 
A (it, <2i . . . Qj), the sequence q% . . . q m satisfies the fol- 
lowing properties: 

• q\=ih 

• q m is the first state of the sequence that belongs to 
Si, 

• there are at most AM 2 — 1 states qj in the same 
component of A^. 

e; n it is a finite set of words having length at most AM 3 . 
Claim 1: The expression eg defined by the following 
equations represents the language L\j\, . . . ,jk] 

e' k = (£ fe )^ 2M2 • ( |J H(q) ■ W(q)) 
qes k 

ej = (£^ 2M2 -( (J H{q)-W{q))-e' i+l for all 1 < i < k 

qeSi 

e = einit .(Si)^ 2 .(|J H{q)W{q))-e' l 

The language of e' clearly contains L[ji, . . . ,jk]. The 
reciproque follows from the above remark, based on 
Lemma p~0[ which concludes the proof of Claim 1. We 
now define the expressions eg, . . . , e& recursively as fol- 
lows (with i ranging from 1 to k in the second equation) : 



e k = ((Ej 



\>2M Z 



e).(|J H(q)-W(q)) 



q&S k 



e t = ((£^ 2M2 + e) • ( [J H(q) ■ W(q) + e) ■ e i+1 

qeSi 

eo = e init • ((Si)^ 2M2 + e) • ( (J H(q)W{q) + e) • e x 

qeSi 

Claim 2: The language of 'eo contains L{j±, . . . ,jk] and 
is contained in L. 
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The language of eo clearly contains the language of e' , 
hence L[j\, . . . ,jk\ by Claim 1. Let w G L(eo). There 
exist uq,vq, ui, v%. . . , u n , and t>„ such that 



• W = UqVqUiVi . . .U n V n 

• u G L(e init • ((S!)^ 2M2 



• for each < i < n-1, v t G £(U 9e s s -H"(?)-W(g)- 

• for each 1 < i < n, u, G L((E, i )- 2A:f2 + e) 

Let u/ be the word obtained from w by replacing every 
Vi equal to e with an arbitrary word from £(U g eS 



H{q). 



W(q)), and every ej equal to e with an arbitrary word 
from L((Ei)^ 2M2 ). Then w' belongs to L(e' ) and in 
particular to L. Consequently, w also belongs to L 
by repeated applications of Lemma 11 As W(q) and 



H(q) arc finite sets of words for every state q, eo be- 
longs to the fragment, which concludes the proof of the 
lemma. □ 

Lemma 19. Let L be a language recognized by a reg- 
ular expression ip in ^ tr . Then L G trC . 

Proof. Since trC is closed by union, we assume that 
ip is of the form <pi ■ . . . ■ (pi where ifi and tpi are words 
and tpi are Vf-terms for every i £ [1,1 — 1]. For each 
i £ [I], we denote by Li the language recognized by </?,-. 
Let M be the size of <p i.e. the number of symbols that 
compose ip. Let u, v, w, Wi, u>2 be words with w\ and u>2 
non empty such that uwf' 1 vw^ 1 w G L. It can be easily 
seen that there is some term ipi of the form A— n + e 
such that wi G A*, uwf 1 G L\...Li and w^vw^w G 
A-' 1 ■ i^+i ■ ■ ■ L[. Similarly, there is some term ^ , j > i 
of the form I?-" 1 + e such that w 2 £ B* , uw^vwf 1 G 
Li . . . Lj and w^w G Lj . . .Li. Thus, uw^w^w G 
L\... Li-i ■ A n A* ■ Lj . . . Li C L. Indeed, ii i = j then 



A n A*Lj C and if « < j then A-Lj C .LjZ, 



□ 



Theorem [7] k-RSPQ is FPT. More precisely, the 
problem is solvable in time 0(2 0( - k ^\A L \ ■ \G\ ■ log|G|). 

Let V be a finite set. A fc-coloring of V is a function 
c : V — > [fc]. A set 5 C V is colorful for c if c(x) = 
c(y) =>■ a; = y for every x, y G 5. The crux of our proof 
is the following result by Alon et al.: 

Theorem 9 (j2j). Givenk,n>OandasetVofn 
elements, one can compute in time 0(2°^ \V\ log |V|) a 
set of I G 0(2°( fc ) i g |y|) k-coloring functions c\, . . . ci 
such that every set S of V of size fc is colorful for at 
least one Ci (i G [I]). 

Proof. Let G, A L , k be an instance of k-RSPQ. We 
compute I k-coloring functions ci , . . . c; as stated in The- 
orem [9] Let c one of these functions. We will show how 
to decide if there is a colorful L-labeled path from x 
to y in (G, c). To this purpose, we define a function 
/ :VxQ L x V([k}) -> {0, 1} such that /(«, g, 5) = 1 if 



there exists a colorful path p starting from x that uses 
only colors of S and such that Ai(iQ,u>) = q where w 
is the label of p. Clearly there is a colorful path from 
x to y if there is a set S C [fc] and a final state q E 
such that /(j/, <j, 5 1 ) = 1. 

The function can be computed by dynamical pro- 
gramming using the following equation. 

• f{x,i Q ,{c{x)}) = 1 

• 9j S) = 1 if there is a subset S' C S such that 
f(v,q,S') = l; 

• 9? S) — 1 if c ( u ) G S and there is an edge v' , a 
state q' and a label a such that f(v', q', S\c(v j) = 1, 
(v,a,v') G J5 and g' G Ai(g, a); 

• /(u, g, 5) = otherwise. 

This function can be computed in time 0(2 k ■ \Al\ ■ 
\G\). We compute / for every function Cj, i G [/] where 
2 G 0(2°Wlog|^|). Consequently, k-RSPQ can be 
solved in time 0(2°^\A L \ ■ \G\ ■ log |G|). □ 

Theorem [8] Let k > on<i G be a class of db-graphs of 
directed treewidth at most K . Then, RSPQ(Reg, Q) is 
polynomial, where Reg denotes the regular languages. 

PROOF sketch. The proof is a straighforward adap- 
tation of the proof proposed in [23] for the Hamiltonian 
Path problem. Since they use a dynamic approach, they 
consider a more general problem: given a digraph G and 
asequence ofktuples (Vi, rij, u^jeffeli are there fc disjoint 
simple paths p\ , . . . pk such that pi is a path of size n, 
from Vi to v'i for every tg [fc]? 

We extend the problem as follows: given a db-graph 
G, a regular language L and a sequence of k tuples 
{vi,ni,v[,qi,q'^i£]u\, are there k words wi,...Wk and 
k disjoint simple paths p\, . . .pf. such that pi is a un- 
labeled path of size rij from to and A^(g i5 u>i) = q\ 
for every ? G [fc]? Therefore, their proof can easily be 
adapted to this new problem. □ 
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